Final Project: Time Series Forecasting with LSTMs, Neural Networks Eng. Class
prof. Victor H. Alves Ribeiro, PPGEPS/PUCPR
Note
Chamada do exercicio em ingles vai aqui
Abstract
This project aims to predict the returns of a selected portfolio of agricultural commodities using Long Short-Term Memory (LSTM) neural networks. The study includes ablation experiments to understand the impact of different architectural choices on model performance. The results demonstrate the importance of hyperparameter tuning and regularization in time series forecasting tasks.
Introduction
The use of Artificial Neural Networks (ANNs) for time series forecasting has gained significant traction in recent years. This project focuses on utilizing LSTMs, a type of recurrent neural network capable of handling sequential data efficiently, to predict financial returns of agricultural commodities. A detailed ablation study is conducted to explore various architectural configurations and hyperparameters.
Literature Review
Time series forecasting is critical in finance and economics. Traditional models, such as ARIMA and exponential smoothing, have limitations when dealing with non-linear and complex data. Recent studies emphasize the robustness of LSTM models in capturing temporal dependencies. This project builds on existing research by applying LSTM networks to a unique dataset of agricultural commodity returns.
…see the full paper for this complete section …
Methods
Collecting commodity price data.
Calculating logarithmic returns.
Normalizing the data.
Training LSTM models with different configurations.
Performing grid search to optimize hyperparameters.
Conducting residual analysis to identify uncaptured patterns and issues like autocorrelation or heteroscedasticity.
Results discussion
Data Collection and Preprocessing
Python libs
Code
# Importing necessary librariesimport pandas as pdimport numpy as np#import yfinance as yffrom yahooquery import Tickerfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.metrics import mean_squared_errorimport matplotlib.pyplot as pltfrom tensorflow.keras.models import Sequentialfrom tensorflow.keras.layers import LSTM, Densefrom tensorflow.keras.optimizers import Adamfrom sklearn.model_selection import ParameterGridimport psutilimport timeimport jsonimport osimport warningswarnings.filterwarnings('ignore')# Additional libraries for residual analysisfrom statsmodels.stats.diagnostic import acorr_ljungbox, het_breuschpaganimport statsmodels.api as sm
First of all, we need to check the hardware availability:
Code
# Collecting hardware informationdef get_system_info(): system_info = {'CPU_cores': psutil.cpu_count(logical=True),'CPU_freq_MHz': psutil.cpu_freq().current,'Total_RAM_GB': round(psutil.virtual_memory().total / (1024**3), 2),'Available_RAM_GB': round(psutil.virtual_memory().available / (1024**3), 2),'GPU_info': 'Not available'# Placeholder, can be expanded with libraries like GPUtil }return system_infosystem_info = get_system_info()print("System Information:", system_info)
from yahooquery import Tickerimport pandas as pdimport numpy as np# Definindo os tickerstickers = ["ZC=F", # Corn Futures"ZW=F", # Wheat Futures"KE=F", # KC HRW Wheat Futures"ZR=F", # Rough Rice Futures"GF=F", # Feeder Cattle Futures"ZM=F", # Soybean Meal Futures"ZL=F", # Soybean Oil Futures"ZS=F"# Soybean Futures]print("\nDownloading price data using yahooquery...")ticker_obj = Ticker(tickers)data = ticker_obj.history(start="2015-01-01")# Se o DataFrame possuir índice MultiIndex (com 'symbol' e 'date'), reinicializamos o índice e pivotamosifisinstance(data.index, pd.MultiIndex): data = data.reset_index()if'close'in data.columns: data = data.pivot(index='date', columns='symbol', values='close')else:print("Column 'close' not found in data.")else:# Caso não possua MultiIndex, verifica se há a coluna 'close' data = data['close'] if'close'in data.columns else data# Converter o índice para datetime (tz-aware para UTC e, em seguida, remover a informação de fuso, ficando tz-naive)data.index = pd.to_datetime(data.index, utc=True).tz_convert(None)# Se data for uma Series, converte para DataFrameifisinstance(data, pd.Series): data = data.to_frame()# Tratamento de dados ausentesprint("\nHandling missing data...")data.fillna(method='ffill', inplace=True) # Preenchimento forwarddata.dropna(axis=1, how='all', inplace=True) # Remove colunas com todos os valores NaNdata.dropna(axis=0, how='any', inplace=True) # Remove linhas com qualquer valor NaN# Verificando os dadosprint("\nData columns and their non-null counts:")print(data.count())if data.empty:print("Data is empty after cleaning. Exiting.") exit()# Calculando os retornos logarítmicosreturns = np.log(data / data.shift(1)).dropna()# Verificando os retornosprint("\nReturns DataFrame info:")print(returns.info())print(returns.head())if returns.empty:print("Returns DataFrame is empty. Exiting.") exit()returns.head() # Exibindo as séries temporais utilizadas (sem features adicionais)
Plotting the time series of prices and returns side by side (2 per row)
Code
# Create a directory for plots if it doesn't existplots_dir ='plots'ifnot os.path.exists(plots_dir): os.makedirs(plots_dir)# Plot pricesprint("\nPlotting time series of prices...")num_cols =2# Number of plots per rownum_plots =len(data.columns)num_rows = (num_plots + num_cols -1) // num_cols # Ensure enough rowsfig, axs = plt.subplots(num_rows, num_cols, figsize=(15, 5* num_rows))axs = axs.flatten()for i, col inenumerate(data.columns): axs[i].plot(data.index, data[col]) axs[i].set_title(f'Price Series - {col}') axs[i].set_xlabel('Date') axs[i].set_ylabel('Price')# Hide unused subplotsfor j inrange(i +1, len(axs)): fig.delaxes(axs[j])plt.tight_layout()plt.savefig(os.path.join(plots_dir, 'price_series.png'))plt.show()plt.close()# Plot returnsprint("Plotting time series of returns...")num_plots_ret =len(returns.columns)num_rows_ret = (num_plots_ret + num_cols -1) // num_colsfig, axs = plt.subplots(num_rows_ret, num_cols, figsize=(15, 5* num_rows_ret))axs = axs.flatten()for i, col inenumerate(returns.columns): axs[i].plot(returns.index, returns[col]) axs[i].set_title(f'Return Series - {col}') axs[i].set_xlabel('Date') axs[i].set_ylabel('Log Return')# Hide unused subplotsfor j inrange(i +1, len(axs)): fig.delaxes(axs[j])plt.tight_layout()plt.savefig(os.path.join(plots_dir, 'return_series.png'))plt.show()plt.close()
Plotting time series of prices...
Plotting time series of returns...
Preprocessing data for LSTM time series modelling:
Code
# Function to prepare data for LSTMdef prepare_data(series, time_steps): X, y = [], []for i inrange(len(series) - time_steps): X.append(series[i:(i + time_steps)]) y.append(series[i + time_steps])return np.array(X), np.array(y)
Setting the parameters:
Code
# Defining parameterstime_steps =5# Number of time stepsepochs =10# Reduced epochs for faster execution during testing# Dictionaries to store resultsmodels = {}histories = {}mse_results = {}scalers = {}predictions = {}best_params_dict = {}residuals_analysis = {}# Directory to save reports and graphsreport_dir ='report'ifnot os.path.exists(report_dir): os.makedirs(report_dir)
LSTM time series model fitting
Code
# Loop through each time seriesfor col in returns.columns:print(f"\nProcessing column: {col}") series = returns[col].values.reshape(-1, 1)# Check if series is emptyiflen(series) ==0:print(f"Series {col} is empty after preprocessing. Skipping.")continueprint(f"Series {col} has {len(series)} data points.")# Normalizing data scaler = MinMaxScaler(feature_range=(0, 1)) series_scaled = scaler.fit_transform(series) scalers[col] = scaler # Storing the scaler for later inversion# Preparing data X, y = prepare_data(series_scaled, time_steps)# Check if X and y are non-emptyif X.shape[0] ==0:print(f"Not enough data points in {col} after preparation. Skipping.")continue# Splitting into training and test sets split_index =int(0.8*len(X)) X_train_full, X_test = X[:split_index], X[split_index:] y_train_full, y_test = y[:split_index], y[split_index:] X_train_full = X_train_full.reshape((X_train_full.shape[0], X_train_full.shape[1], 1)) X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))# Hyperparameter grid for Grid Search param_grid = {'neurons': [30, 50],'learning_rate': [0.001, 0.01],'activation': ['tanh', 'relu'],'batch_size': [32, 64] } grid = ParameterGrid(param_grid)# Initializing variables to store best results best_mse =float('inf') best_params =None best_model =None# Performing Grid Searchprint(f"Performing Grid Search for {col}...")for params in grid: model = Sequential() model.add(LSTM(params['neurons'], activation=params['activation'], input_shape=(time_steps, 1))) model.add(Dense(1)) optimizer = Adam(learning_rate=params['learning_rate']) model.compile(optimizer=optimizer, loss='mean_squared_error') history = model.fit( X_train_full, y_train_full, validation_data=(X_test, y_test), epochs=epochs, batch_size=params['batch_size'], verbose=0 ) y_pred = model.predict(X_test) y_pred_inv = scaler.inverse_transform(y_pred) y_test_inv = scaler.inverse_transform(y_test.reshape(-1, 1)) mse = mean_squared_error(y_test_inv, y_pred_inv)if mse < best_mse: best_mse = mse best_params = params best_model = model best_y_pred = y_pred best_params_dict[col] = best_paramsprint(f"Best parameters for {col}: {best_params} with MSE: {best_mse}") models[col] = best_model predictions[col] = {'Best Model': best_y_pred}# Inverting the normalization y_test_inv = scaler.inverse_transform(y_test.reshape(-1, 1)) y_pred_inv = scaler.inverse_transform(best_y_pred)# Calculating MSE mse_results[col] = {'Best Model': best_mse}# Visualization of results plt.figure(figsize=(10, 4)) plt.plot(y_test_inv, label='Actual Value') plt.plot(y_pred_inv, label='Prediction') plt.title(f'Prediction vs Actual - {col} - Best Model') plt.legend() plt.savefig(os.path.join(report_dir, f'pred_vs_actual_{col}_Best_Model.png')) plt.close()# Residual Analysis residuals = y_test_inv - y_pred_inv# Plotting residuals plt.figure(figsize=(10, 4)) plt.plot(residuals, label='Residuals') plt.title(f'Residuals - {col} - Best Model') plt.legend() plt.savefig(os.path.join(report_dir, f'residuals_{col}_Best_Model.png')) plt.close()# Ljung-Box test for autocorrelation in residuals lb_test = acorr_ljungbox(residuals, lags=[10], return_df=True) lb_pvalue = lb_test['lb_pvalue'].values[0]# Plotting residuals ACF fig, ax = plt.subplots(figsize=(10, 4)) sm.graphics.tsa.plot_acf(residuals.squeeze(), lags=40, ax=ax) plt.title(f'Residuals Autocorrelation Function - {col}') plt.savefig(os.path.join(report_dir, f'acf_residuals_{col}_Best_Model.png')) plt.close()# Heteroscedasticity test (Breusch-Pagan Test) exog = sm.add_constant(best_model.predict(X_test)) test_bp = het_breuschpagan(residuals, exog) bp_pvalue = test_bp[3]# Convert p-values to Python float lb_pvalue =float(lb_pvalue) bp_pvalue =float(bp_pvalue)# Saving statistical test results residuals_analysis[col] = {'residuals': residuals.flatten().tolist(),'ljung_box_pvalue': lb_pvalue,'breusch_pagan_pvalue': bp_pvalue }print(f"Residual Analysis for {col}:")print(f"Ljung-Box Test p-value: {lb_pvalue}")print(f"Breusch-Pagan Test p-value: {bp_pvalue}")# Displaying final results in a tableprint("\nFinal Results:")results_table = pd.DataFrame(mse_results)print(results_table)
# Saving results to a CSV fileresults_table.to_csv(os.path.join(report_dir, 'mse_results_updated.csv'), index=True)# Saving the best parameters foundwithopen(os.path.join(report_dir, 'best_params.json'), 'w') as f: json.dump(best_params_dict, f, indent=4)# Saving the residual analysiswithopen(os.path.join(report_dir, 'residuals_analysis.json'), 'w') as f: json.dump(residuals_analysis, f, indent=4)
Ploting the MSEs for each time series:
Code
# Report: Documenting the results# Plotting the MSEs for each time seriesfor col in mse_results.keys(): mse_series = mse_results[col] plt.figure(figsize=(10, 5)) plt.bar(mse_series.keys(), mse_series.values(), color='blue') plt.title(f'MSE Comparison - {col}') plt.ylabel('MSE') plt.xticks(rotation=45) plt.tight_layout() plt.savefig(os.path.join(report_dir, f'mse_comparison_{col}.png')) plt.close()
Saving system info:
Code
# End timerend_time = datetime.now()elapsed_time = end_time - start_time # This is a timedelta objectprint(f"Total execution time: {elapsed_time}")# Save execution time to the reportsystem_info['Execution_Time_seconds'] = elapsed_time.total_seconds() # Convert to float for JSONwithopen(os.path.join(report_dir, 'system_info.json'), 'w') as f: json.dump(system_info, f, indent=4)
Total execution time: 0:10:16.011476
Generating an automatic final report:
Code
# Final Report: Generating a text document with the resultsreport_path = os.path.join(report_dir, 'final_report.txt')withopen(report_path, 'w') as report_file: report_file.write("Final Project Report - Forecasting Commodity Returns with LSTM\n") report_file.write("="*80+"\n\n") report_file.write("1. Project Objectives:\n") report_file.write("Forecast future returns of a commodity portfolio using LSTM Neural Networks.\n\n") report_file.write("2. Methodology:\n") report_file.write("- Collecting commodity price data.\n") report_file.write("- Calculating logarithmic returns.\n") report_file.write("- Normalizing the data.\n") report_file.write("- Training LSTM models with different configurations.\n") report_file.write("- Performing grid search to optimize hyperparameters.\n") report_file.write("- Conducting residual analysis to identify uncaptured patterns and issues like autocorrelation or heteroscedasticity.\n\n") report_file.write("3. Results:\n") report_file.write(results_table.to_string()) report_file.write("\n\n") report_file.write("4. Best Parameters Found (Grid Search):\n") report_file.write(json.dumps(best_params_dict, indent=4)) report_file.write("\n\n") report_file.write("5. Residual Analysis:\n")for col, res in residuals_analysis.items(): report_file.write(f"Residual Analysis for {col}:\n") report_file.write(f"Ljung-Box Test p-value: {res['ljung_box_pvalue']}\n") report_file.write(f"Breusch-Pagan Test p-value: {res['breusch_pagan_pvalue']}\n\n") report_file.write("\n") report_file.write("6. Conclusions:\n") report_file.write("The study demonstrated the importance of proper hyperparameter selection and model architecture for forecasting financial returns. Regularization techniques and the choice of activation function significantly influenced model performance. The residual analysis highlighted the need to consider autocorrelation and heteroscedasticity in modeling financial time series.\n\n") report_file.write("7. Recommendations for Future Work:\n") report_file.write("- Implement additional regularization techniques, such as DropConnect or Batch Normalization.\n") report_file.write("- Explore more advanced architectures, like GRU or bidirectional models.\n") report_file.write("- Increase the dataset to improve the models' generalization capacity.\n") report_file.write("- Use more robust cross-validation methods to assess model stability.\n") report_file.write("- Integrate other features, such as technical indicators or macroeconomic variables, to enrich model inputs.\n") report_file.write("- Consider hybrid models that combine Machine Learning techniques with traditional statistical models.\n") report_file.write("\nSystem Information and Execution Time:\n") report_file.write(json.dumps(system_info, indent=4)) report_file.write("\n\n") report_file.write("End of Report.\n")
Results and Discussion
We begin reading the stored results by MSEs:
Code
# Reading the stored results# 1. Reading the MSE results file# Define the report directoryreport_dir ='report'# Path to the MSE results filemse_results_path = os.path.join(report_dir, 'mse_results_updated.csv')# Read the CSV filemse_results = pd.read_csv(mse_results_path, index_col=0)# Display the DataFrameprint("\nMSE Results of the models:")print(mse_results)
MSE Results of the models:
GF=F KE=F ZC=F ZL=F ZM=F ZR=F \
Best Model 0.001864 0.000412 0.000315 0.001078 0.000333 0.129138
ZS=F ZW=F
Best Model 0.0002 0.000413
Then we go to the best hyperparametes file for each time series univariate modelling:
Code
# 2. Reading the best parameters file# Path to the best parameters filebest_params_path = os.path.join(report_dir, 'best_params.json')# Read the JSON filewithopen(best_params_path, 'r') as f: best_params = json.load(f)# Display the best parametersprint("\nBest parameters found for each time series:")for series, params in best_params.items():print(f"{series}: {params}")
# 3. Reading the final report# Path to the final reportreport_path = os.path.join(report_dir, 'final_report.txt')# Read the reportwithopen(report_path, 'r') as report_file: report_content = report_file.read()# Display the reportprint("\nFinal Report Content:")print(report_content)
Final Report Content:
Final Project Report - Forecasting Commodity Returns with LSTM
================================================================================
1. Project Objectives:
Forecast future returns of a commodity portfolio using LSTM Neural Networks.
2. Methodology:
- Collecting commodity price data.
- Calculating logarithmic returns.
- Normalizing the data.
- Training LSTM models with different configurations.
- Performing grid search to optimize hyperparameters.
- Conducting residual analysis to identify uncaptured patterns and issues like autocorrelation or heteroscedasticity.
3. Results:
GF=F KE=F ZC=F ZL=F ZM=F ZR=F ZS=F ZW=F
Best Model 0.001864 0.000412 0.000315 0.001078 0.000333 0.129138 0.0002 0.000413
4. Best Parameters Found (Grid Search):
{
"GF=F": {
"activation": "relu",
"batch_size": 64,
"learning_rate": 0.01,
"neurons": 50
},
"KE=F": {
"activation": "relu",
"batch_size": 32,
"learning_rate": 0.01,
"neurons": 30
},
"ZC=F": {
"activation": "relu",
"batch_size": 32,
"learning_rate": 0.01,
"neurons": 50
},
"ZL=F": {
"activation": "relu",
"batch_size": 32,
"learning_rate": 0.01,
"neurons": 30
},
"ZM=F": {
"activation": "tanh",
"batch_size": 32,
"learning_rate": 0.01,
"neurons": 50
},
"ZR=F": {
"activation": "tanh",
"batch_size": 64,
"learning_rate": 0.001,
"neurons": 50
},
"ZS=F": {
"activation": "tanh",
"batch_size": 32,
"learning_rate": 0.01,
"neurons": 50
},
"ZW=F": {
"activation": "tanh",
"batch_size": 64,
"learning_rate": 0.01,
"neurons": 30
}
}
5. Residual Analysis:
Residual Analysis for GF=F:
Ljung-Box Test p-value: 3.114439298300653e-19
Breusch-Pagan Test p-value: 5.572021567735004e-12
Residual Analysis for KE=F:
Ljung-Box Test p-value: 0.006590268761731124
Breusch-Pagan Test p-value: 0.415827361757533
Residual Analysis for ZC=F:
Ljung-Box Test p-value: 1.2706819946105328e-08
Breusch-Pagan Test p-value: 0.8329847721504908
Residual Analysis for ZL=F:
Ljung-Box Test p-value: 4.039168937925409e-11
Breusch-Pagan Test p-value: 0.0004951263920116207
Residual Analysis for ZM=F:
Ljung-Box Test p-value: 0.159790669133358
Breusch-Pagan Test p-value: 0.4065472611473179
Residual Analysis for ZR=F:
Ljung-Box Test p-value: 1.2371562402040067e-10
Breusch-Pagan Test p-value: 4.876451554339863e-08
Residual Analysis for ZS=F:
Ljung-Box Test p-value: 0.0030306191622350835
Breusch-Pagan Test p-value: 0.0002490492748648036
Residual Analysis for ZW=F:
Ljung-Box Test p-value: 0.002700318474036124
Breusch-Pagan Test p-value: 0.5546230357603605
6. Conclusions:
The study demonstrated the importance of proper hyperparameter selection and model architecture for forecasting financial returns. Regularization techniques and the choice of activation function significantly influenced model performance. The residual analysis highlighted the need to consider autocorrelation and heteroscedasticity in modeling financial time series.
7. Recommendations for Future Work:
- Implement additional regularization techniques, such as DropConnect or Batch Normalization.
- Explore more advanced architectures, like GRU or bidirectional models.
- Increase the dataset to improve the models' generalization capacity.
- Use more robust cross-validation methods to assess model stability.
- Integrate other features, such as technical indicators or macroeconomic variables, to enrich model inputs.
- Consider hybrid models that combine Machine Learning techniques with traditional statistical models.
System Information and Execution Time:
{
"CPU_cores": 12,
"CPU_freq_MHz": 1800.0,
"Total_RAM_GB": 31.69,
"Available_RAM_GB": 10.41,
"GPU_info": "Not available",
"Execution_Time_seconds": 616.011476
}
End of Report.
At the end we read the graphs for MSEs:
Code
# 4. Viewing the graphsfrom IPython.display import Image, display# List of time seriesseries_list = returns.columns# Display MSE comparison graphsfor col in series_list: image_path = os.path.join(report_dir, f'mse_comparison_{col}.png')if os.path.exists(image_path): display(Image(filename=image_path))else:print(f"Graph {image_path} not found.")# Display residuals graphsfor col in series_list: residuals_image_path = os.path.join(report_dir, f'residuals_{col}_Best_Model.png') acf_image_path = os.path.join(report_dir, f'acf_residuals_{col}_Best_Model.png')if os.path.exists(residuals_image_path): display(Image(filename=residuals_image_path))else:print(f"Graph {residuals_image_path} not found.")if os.path.exists(acf_image_path): display(Image(filename=acf_image_path))else:print(f"Graph {acf_image_path} not found.")# End of code
Conclusions
References
Code
##| eval: false# Total timing to compile this Quarto documentend_time = datetime.now()time_diff = end_time - start_timeprint(f"Total Quarto document compiling time: {time_diff}")
Total Quarto document compiling time: 0:10:16.946894